Genetic effects of sequence-conserved enhancer-like elements on human complex traits

Background The vast majority of findings from human genome-wide association studies (GWAS) map to non-coding sequences, complicating their mechanistic interpretations and clinical translations. Non-coding sequences that are evolutionarily conserved and biochemically active could offer clues to the mechanisms underpinning GWAS discoveries. However, genetic effects of such sequences have not been systematically examined across a wide range of human tissues and traits, hampering progress to fully understand regulatory causes of human complex traits. Results Here we develop a simple yet effective strategy to identify functional elements exhibiting high levels of human-mouse sequence conservation and enhancer-like biochemical activity, which scales well to 313 epigenomic datasets across 106 human tissues and cell types. Combined with 468 GWAS of European (EUR) and East Asian (EAS) ancestries, these elements show tissue-specific enrichments of heritability and causal variants for many traits, which are significantly stronger than enrichments based on enhancers without sequence conservation. These elements also help prioritize candidate genes that are functionally relevant to body mass index (BMI) and schizophrenia but were not reported in previous GWAS with large sample sizes. Conclusions Our findings provide a comprehensive assessment of how sequence-conserved enhancer-like elements affect complex traits in diverse tissues and demonstrate a generalizable strategy of integrating evolutionary and biochemical data to elucidate human disease genetics. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-023-03142-1.


Supplementary Notes Note S1: Connection between the original RSS-NET and the extension
The RSS-NET extension in the present study is closely related to the original RSS-NET model.Both models are built on the same multiple-SNP regression likelihood for GWAS summary statistics.Furthermore, the prior distribution used in the RSS-NET extension is a specific instance of the prior used in the original RSS-NET model, as demonstrated below.
In the original RSS-NET model, the prior distribution is given by where β j is the true effect of each SNP j, π j denotes the probability that SNP j is associated with the trait (β j ̸ = 0), N (µ j , σ 2 0 ) denotes a normal distribution with mean µ j and variance σ 2 0 specifying the effect size of a trait-associated SNP j, δ 0 denotes the point mass at zero (β j = 0), ℓ j = 1 if SNP j is within 100 kb of a given regulatory network and 0 otherwise, O j is the set of all nearby or distal genes contributing to the total effect of SNP j, w j g measures the relevance between SNP j and gene g, γ j g denotes the random effect of SNP j on a trait due to gene g, and {θ 0 , θ, σ 0 , σ} are hyper-parameters capturing baseline and enrichment effects.
After integrating out the random effects {γ j g }, we can rewrite the prior above as follows: Lastly, we set ℓ j = a j and g∈O j w 2 j g = a j , resulting in the prior used in the RSS-NET extension: where a j = 1 if SNP j falls inside HC ELEs and 0 otherwise.Due to this close connection between the original RSS-NET and the present extension, we can reuse the efficient algorithm developed for the original RSS-NET to fit the extension model.

Note S2: Details of the simulations to assess the RSS-NET extension
We used real genotypes of 348965 genome-wide SNPs on chromosomes 1-22 from 1458 individuals in the UK Blood Service Control Group to simulate phenotype data, and then computed the standard single-SNP association summary statistics.On these summary data, we compare the results of RSS-NET extension with the ground truth of each simulation scenario.
We used the HC omnibus ELEs to create binary annotations for 348965 genome-wide SNPs.Specifically, we let a j = 1 if SNP j falls inside the HC omnibus ELEs, and let a j = 0 otherwise.There are 19335 SNPs mapped to the HC omnibus ELEs with annotation a j = 1.
To ensure comparable proportions of true genetic signals across simulated datasets in a given scenario, we simulated causal indicators for negative and positive datasets in a paired way.We first simulated the causal indicator ζ j of each SNP j in a positive dataset as: ζ j ∼ Bernoulli(π j ) and π j = 1 + 10 −(θ 0 +a j •θ) −1 , where θ 0 and θ are the baseline and enrichment proportion parameters, respectively.We then counted the number of causal SNPs in this positive dataset as n c := j ζ j , and randomly chose n c SNPs as causal SNPs for the corresponding negative dataset.
For positive datasets, we simulated the true effect β j of each SNP j as where δ 0 denotes the point mass at zero, σ 2 0 and σ 2 are the baseline and enrichment magnitude parameters respectively, and a j denotes the annotation of SNP j based on HC omnibus ELEs.For negative datasets, we simulated β j from the same model above with σ 2 = 0.
To ensure comparable total genetic signals between negative and positive datasets, we simulated phenotypes by matching signal-to-noise ratios across simulated datasets.Specifically, we simulated phenotype y i of individual i as y i = p j=1 x i j • β j + ϵ i , ϵ i ∼ N (0, τ −1 ), where p = 348965 and x i j is the genotype of SNP j for individual i.Let y = [y i ] and X = [x i j ].The true value of residual variance τ −1 is determined by the total proportion of variance in phenotype y explained by all available SNPs in X: We set the true values of hyper-parameters for positive datasets as follows: baseline proportion parameter θ 0 ∈ {−2, −4}, enrichment proportion parameter θ = 2, baseline magnitude parameter σ 0 = 1, enrichment magnitude parameter σ ∈ {0, 2}, and PVE ∈ {0.2, 0.6}, resulting in a total of 8 unique simulation scenarios.For each scenario of {θ 0 , θ, σ 0 , σ, PVE}, we simulated 200 negative and 200 positive datasets independently.
We applied the RSS-NET extension to simulated datasets with the following hyper-parameter grids.The baseline parameters θ 0 and σ 0 were fixed as their true values.The grid on the enrichment proportion parameter θ was {0, 1.5, 1.75, 2, 2.25, 2.5}.The grid on the enrichment magnitude parameter σ was {0, 0.25, 0.5, 0.75, 1} if the true value of σ was zero; otherwise it was {0, 1.5, 1.75, 2, 2.25, 2.5}.For the same simulation scenario, we used the same hyper-parameter grid in the analysis of all negative and positive datasets.Posterior estimates of {θ, σ} are shown in Fig. S5.
We evaluated the performance of the RSS-NET extension for enrichment testing on the paired positive and negative datasets in each simulation scenario.For all datasets the target of enrichment testing is the full set of HC omnibus ELEs.A false positive occurs if a method identifies enrichment in a negative dataset.A true positive occurs if a method identifies enrichment in a positive dataset.We evaluated the performance of enrichment testing by plotting the receiver operating characteristic (ROC) curve and computing the area under the ROC curve (AUROC) for each simulation scenario.Both metrics are implemented in the R package plotROC.The results are shown in Fig. S6.
We evaluated the performance of the RSS-NET extension for association testing only on positive datasets in each simulation scenario.For each positive dataset, a HC ELE is defined as "traitassociated" if at lease one SNP j within this element has non-zero effect (β j ̸ = 0).In each simulated dataset, we quantified the strength of genetic association for each HC ELE by P H 1 , the posterior trait-associated probability evaluated under the model that HC ELEs are enriched for genetic associations with the trait.An association between a non-trait-associated HC ELE and the trait is a "false positive".An association between a trait-associated HC ELE and the trait is a "true positive".
We visualized the performance of the RSS-NET extension for association testing in two distinct ways.First, we plotted the ROC curve and computed the AUROC for each simulation scenario, as we did in the assessment of enrichment testing above.Unlike the balanced structure of enrichment testing (200 positive and 200 negative datasets in each scenario), the numbers of trait-associated HC ELEs (true labels) and non-trait-associated HC ELEs (false labels) differed a lot for most simulated datasets.Due to this imbalanced structure, we additionally plotted the precision-recall (PRC) curve and computed the area under the PRC curve (AUPRC) for each simulation scenario.Both metrics are implemented in the R package precrec.The results are shown in

Note S3: Hyper-parameter inference for the RSS-NET extension
The RSS-NET extension consists of four unknown hyper-parameters {θ 0 , θ, σ 2 0 , σ 2 }.We followed the approach from the original RSS-NET publication to infer these unknown hyper-parameters.Specifically, we first introduced two free parameters {η, ρ} to re-parameterize {σ 2 0 , σ 2 }: , where η represents the proportion of the total phenotypic variation explained by p SNPs, and ρ represents the proportion of total genetic variation explained by SNPs with a given annotation (a j = 1).We then placed independent uniform grid hyper-priors on {θ 0 , θ, η, ρ}.
When applying the RSS-NET extension to the GWAS of BMI and schizophrenia, we used the following grid hyper-priors: The posterior estimates of hyper-parameters {θ 0 , θ, σ 2 0 , σ 2 } help explain the larger number of genetic associations for BMI and schizophrenia identified under the enrichment model for HC ELEs, compared to the baseline model.On the same GWAS, the RSS-NET extension gives similar estimates for the baseline parameters {θ 0 , σ 2 0 } under both the baseline and enrichment models, and it produces positive estimates for the enrichment parameters {θ, σ 2 } if HC ELEs are identified as enriched for genetic associations with BMI (Fig. 5a) and schizophrenia (Fig. 6a).The positive estimate of θ increases π j = 1 + 10 −(θ 0 +a j •θ) −1 , the prior probability of association for SNPs in HC ELEs, which in turn increases the posterior probability of association for these SNPs.The positive estimate of σ 2 increases σ 2 j = σ 2 0 + a j • σ 2 , the prior effect size for trait-associated SNPs in HC ELEs, which in turn increases the posterior effect size for these SNPs.Collectively, the positive enrichment parameter estimates enhance the evidence for associations between SNPs in HC ELEs and the GWAS trait, leading to the identification of more trait-associated elements under the enrichment model for HC S9; Table S15).

Note S4: Quality control for reference LD estimates in the RSS-NET extension
Built upon the multiple-SNP regression likelihood for GWAS summary statistics, the RSS-NET extension requires the input of reference LD estimates that align with GWAS summary statistics.In the present study, we have executed a series of quality control (QC) procedures to ensure consistency between GWAS summary statistics and reference LD estimates when running the RSS-NET extension.The following table lists the major sources of GWAS-LD inconsistencies to date, as well as the specific QC procedures that we have implemented in this study to address these issues.

Source of potential GWAS-LD mismatches QC procedure used in the present study
Differences in population or ancestry between the sample to generate GWAS summary statistics and the sample to derive reference LD estimates.
Since the GWAS summary statistics of BMI and schizophrenia analyzed in this study were generated from European ancestry cohorts, we estimated LD using the haplotypes of unrelated individuals with European ancestry from Phase 3 of the 1000 Genomes Project.Errors in genotyping or imputation, as well as genetic variants with incorrectly specified effect alleles.
We used the HapMap3 SNPs as an approximation of the well-imputed genome-wide SNPs, and we further excluded SNPs on sex chromosomes, SNPs with minor allele frequency less than 1%, and SNPs in the major histocompatibility complex region to avoid potential LD mismatches caused by poorly genotyped or imputed SNPs and incorrect allele encodings.Inconsistent sample size coverage at different genetic variants when meta-analyzed cohorts are genotyped or imputed at variable density.
We excluded SNPs that were not analyzed in all individual cohorts in the GWAS meta-analysis, such as SNPs measured on custom arrays (e.g., Metabochip and Immunochip).'Allele flip' in the sense that alleles in the GWAS summary statistics are encoded differently from those in the reference LD estimates.
For each SNP analyzed, we checked whether its alleles were consistently encoded in both the GWAS summary statistics and the reference LD estimates.We manually adjusted SNPs that used different allele encodings prior to any downstream analysis.

Supplementary Table Legends
Table S1: Human epigenomic data.Accession numbers of H3K27ac and chromatin accessibility profiles for 106 human tissues and cell types.
Table S2: Motif enrichments of ELEs.For each set of ELEs (omnibus or context-specific, all or conserved), significant motif enrichments are identified by HOMER (≥ 2-fold and P ≤ 1 × 10 −12 ).For all ELEs without considering sequence conservation (NC), the background is set as randomly selected sequences that match the GC-content distribution of NC ELEs.For conserved ELEs (LC, MC, HC), the background is set as either (1) GC-matched random sequences or (2) NC ELEs in the same context.
Table S3: Pairwise correlations for ELE annotations of SNPs.For each set of ELEs (omnibus or context-specific, all or conserved), the binary SNP annotation is created for both EUR and EAS populations.For each population, two sets of genome-wide SNPs are used: 'all SNPs' (EUR: 9997231; EAS: 8768561); 'common SNPs' (EUR: 5961159; EAS: 5469053).For each pair of SNP annotations, the correlation is measured by Cramér's V , ranging from 0 (no correlation) to 1 (complete correlation).
Table S4: Correlations between ELE annotations and 96 known annotations of SNPs.The S-LDSC baselineLD model consists of 83 binary and 13 quantitative annotations of SNPs.The correlation between an ELE annotation and a binary S-LDSC annotation is measured by Cramér's V .The correlation between an ELE annotation and a quantitative S-LDSC annotation is measured by Pearson's R. The rest is the same as Table S3.
Table S6: Heritability enrichments of omnibus ELEs.S-LDSC results of SNP annotations based on all omnibus ELEs (NC) and subsets with varying levels of sequence conservation (low: LC; moderate: MC; high: HC) for 468 GWAS datasets.
Table S7: Meta-analyzed heritability enrichments of omnibus ELEs.S-LDSC results of SNP annotations based on all omnibus ELEs (NC) and subsets with varying levels of sequence conservation (low: LC; moderate: MC; high: HC) meta-analyzed in each of the 30 strata.
Table S8: Heritability enrichments of HC context-specific ELEs.S-LDSC results of HC context-specific ELEs for 468 GWAS and 105 contexts.
Table S9: Meta-analyzed heritability enrichments of HC context-specific ELEs.S-LDSC results of HC context-specific ELEs meta-analyzed in each of the 432 strata.
Table S10: Mouse epigenomic data.Accession numbers of H3K27ac profiles for 14 mouse contexts and human epigenomic data matched to each context.Table S11: Heritability enrichments of HC ELEs with H3K27ac conservation.S-LDSC results of SNP annotations based on NC context-specific ELEs, HC context-specific ELEs using humanmouse H3K27ac comparison, and HC context-specific ELEs using human-mouse genome comparison, for 379 GWAS and 14 contexts.
Table S12: Meta-analyzed heritability enrichments of ELEs with H3K27ac conservation.S-LDSC results of HC context-specific ELEs using human-mouse H3K27ac comparison and humanmouse genome comparison, meta-analyzed in each of the 34 strata.
Table S13: Fractions of fine-mapped SNPs in omnibus ELEs.Each row reports the following: #{fine-mapped SNPs that fall inside a region and have PIPs ≥ a threshold} #{fine-mapped SNPs that fall inside a region} , for a combination of GWAS dataset, fine-mapping method and PIP threshold (columns 1-3), where the 'region' denotes the whole genome (column 4), all omnibus ELEs (NC; column 5), or omnibus ELEs with varying levels of sequence conservation (low: LC; moderate: MC; high: HC; columns 6-8).
Table S14: Fractions of fine-mapped SNPs in HC context-specific ELEs.Each row reports the following fraction (column 6): #{fine-mapped SNPs that fall inside HC ELEs and have PIPs ≥ 0.1} #{fine-mapped SNPs that fall inside HC ELEs} , for a combination of GWAS dataset, fine-mapping method and context group (columns 1-3), together with a binomial P-value (column 7) testing if this fraction is greater than the fraction of all finemapped SNPs that have PIPs ≥ 0.1 (using the same GWAS data and fine-mapping method).
Table S15: Genetic associations of HC omnibus ELEs with BMI and schizophrenia.Each row represents a HC omnibus ELE (columns 1-3), and reports the posterior trait-associated probabilities estimated under the baseline model without any enrichment (P B 1 ), under the enrichment model for NC ELEs (P N 1 ), and under the enrichment model for HC ELEs (P H 1 ), based on the GWAS of BMI (columns 5-7) and schizophrenia (columns 8-10).'Neural' (column 4) indicates if the element contains any HC context-specific ELE derived from neural samples (1: yes; 0: no), as visualized in Fig. s S1f-i.
Table S16: Motif enrichments of HC ELEs associated with BMI.Each row represents either a de novo or known motif whose enrichment result is generated by HOMER.The target sequences are HC omnibus ELEs that are present in neural samples and are associated with BMI at P H 1 ≥ 0.9, as indicated in columns 4 and 7 of Table S15.The background sequences are HC omnibus ELEs that are present in neural samples and are associated with BMI at P H 1 ≤ 0.1.Table S17: Motif enrichments of HC ELEs associated with schizophrenia.The target and background sequences are HC omnibus ELEs that are present in neural samples and are associated with schizophrenia at P H 1 ≥ 0.9 and P H 1 ≤ 0.1, respectively; see columns 4 and 10 of Table S15.The rest are the same as in Table S16.

Table S18: Putative target genes of BMI-associated HC ELEs with MEIS1-binding motif.
Each row reports a published linkage between a likely target gene and a neural-related, BMIassociated HC omnibus ELE (P H 1 ≥ 0.9) that contains the enriched sequence motif recognized by MEIS1, together with the corresponding publication (PubMed ID).S18.Each row reports a Metascape pathway (columns 1-3), the number and fraction of target genes that belong to this pathway (columns 4-5), the log10 P-value (column 6) based on the hypergeometric distribution (with all genes in the genome as the enrichment background), and the log10 q-value (column 7) based on the Benjamini-Hochberg procedure to account for multiple testing.

Table S20
: Putative target genes of BMI-associated HC ELEs with DLX3-binding motif.All elements listed here contain instances of the enriched DLX3-binding motif.The rest are the same as in Table S18.S20.The legend is the same as the legend of Table S19.S18 and S20.Each row reports the identifiers of a gene in Mouse Genome Informatics (MGI), Online Mendelian Inheritance in Man (OMIM) and Therapeutic Target Database (TTD).A blank entry indicates the absence of a gene in a database.

Table S22: External database lookup of genes in Tables
Table S23: Putative target genes of schizophrenia-associated HC ELEs with POU3F3binding motif.The elements listed here are neural-related, schizophrenia-associated HC omnibus ELEs (P H 1 ≥ 0.9) that contain the enriched motif recognized by POU3F3.The rest are the same as in Table S18.
Table S24: Putative target genes of schizophrenia-associated HC ELEs with HAND1::TCF3binding motif.The elements listed here are neural-related, schizophrenia-associated HC omnibus ELEs (P H 1 ≥ 0.9) that contain the enriched HAND1::TCF3 motif.The rest are the same as in Table S18.S23 and S24.The legend is the same as Table S22.

Table S25: External database lookup of genes in Tables
Table S26: GWAS Catalog lookup.The column heading description is available in the GWAS Catalog (https://www.ebi.ac.uk/gwas/docs/fileheaders).6) Immune ( 10) Kidney ( 5) Lung ( 4) Mental ( 11) Metabolic ( 12) Neurological ( 4) Reproductive ( 4 6) Immune ( 10) Kidney ( 5) Lung ( 4) Mental ( 11) Metabolic ( 12) Neurological ( 4  For each trait and each of the 14 contexts with mouse H3K27ac ChIP-seq data available (Table S10), we estimate the heritability enrichment and standardized effect size (τ ⋆ ) for HC ELEs based on the conservation with the mouse genome (Fig. 1a) and HC ELEs based on the conservation with the context-matched mouse H3K27ac profile, both conditioning on NC ELEs in the same context and 96 known annotations.a Schematic of creating HC ELEs based on the mouse H3K27ac profile.To achieve a fair comparison, HC ELEs based on the mouse genome and mouse H3K27ac profile use the same level of sequence conservation (minMatch=0.9).b Estimates meta-analyzed across 14 contexts and 379 GWAS datasets curated for this study, or a previously defined set of 47 independent GWAS datasets.c Estimates meta-analyzed across 379 GWAS datasets for each of the 14 contexts.The dashed lines have intercept 0 and slope 1. Numerical results of b and c are available in Table S12.The 'Meta-context' results shown in c correspond to the meta-analyzed results of 379 GWAS in b.For b-c, each point denotes an estimate and each error bar denotes ±1.96 SE.   1 thresholds.The false positive rate is defined as the proportion of HC ELEs that do not contain any causal SNP but reach a given significance threshold of P H 1 (i.e., false positives) among all the HC ELEs that do not contain any causal SNP.The false discovery rate is defined as the proportion of false positives among all the HC ELEs with posterior trait-associated probability under the enrichment model for HC ELEs reaching a given threshold (e.g., P H 1 ≥ 0.9).The title of each panel indicates the ground truth of each simulation scenario (Note S2).For each of the 8 simulation scenarios, false positive and discovery rates are computed based on all 200 simulated datasets.The yellow dot in each panel indicates the false positive or discovery rate at P H 1 ≥ 0.9 for each simulation scenario.shown on the vertical axis and its chromosomal position shown on the horizontal axis.The red horizontal dashed line indicates the significance threshold for posterior trait-associated probabilities (P H 1 = 0.9).c Each triangle represents the same trait-associated HC ELE (P H 1 ≥ 0.9) identified by the RSS-NET extension in the given genomic region, and each bar represents a protein-coding gene.If the trait-associated HC ELE is linked to a gene according to a recently published gene regulatory network, then they are connected by a curved edge.The edge color corresponds to the estimated edge weight in the gene regulatory network.For a-c, the green vertical solid line indicates the trait-associated HC ELE (P H 1 ≥ 0.9) identified by the RSS-NET extension, and the blue rectangular area indicates the prioritized effector gene that is likely regulated by the trait-associated HC ELE and is supported by multiple lines of external evidence (Fig. S15).

Figure S1 :
Figure S1: Identification and characterization of omnibus ELEs. a Context-specific ELEs from various contexts (colored bars) are first aggregated and then the overlapping segments are merged into 4 non-overlapping omnibus ELEs (black bars).b Numbers of context-specific and omnibus ELEs without considering sequence conservation (y-axis) and those with sequence conservation (x-axis).c Distributions of context counts for all omnibus ELEs (NC) and the sequence-conserved subsets (LC, MC, HC).d Distributions of context group counts for each NC, LC, MC and HC omnibus ELE. e Percentages of omnibusELEs overlapping with, proximal to, and distal from annotated transcriptional start sites (TSSs).We first determine the center of each ELE, and then we categorize an ELE as (1) "TSS-overlapping" if its center is located within 200 bp of a TSS, (2) "TSSproximal" if its center falls between 200 bp and 2 kb of a TSS, and (3) "TSS-distal" if the genomic distance from its center to the nearest TSS exceeds 2 kb.The annotated TSS list is obtained from UCSC RefGene hg19.f-i Illustration of 4 neural-related HC omnibus ELEs that contain significant genetic associations with BMI (f-g; Fig.s 5c-d and S11-12) and schizophrenia (h-i; Fig.s 6c-d and S13-14).j Color legends for context groups shown in f-i, as well as Fig.s 3 and S3.

Figure S3 :Figure S4 :
Figure S2: Additional results of heritability enrichments for sequence-conserved omnibus ELEs.Estimates metaanalyzed within each of a 16 trait groups for the 108 EUR GWAS datasets and b 9 trait groups for the 312 UK Biobank datasets.Numerical results are available in Table S7.The legend is the same as Fig.s 2a-b.

Figure S5 :
Figure S5: Posterior estimation of enrichment parameters in the RSS-NET extension across 4 simulation scenarios.The x-axis indicates the true hyper-parameter values for each simulation scenario (Note S2).The y-axis indicates the posterior mean estimates (upper: σ; lower: θ).Each box contains results of 200 independent positive datasets for a given scenario.Box plots indicate median (middle line), 25th, 75th percentile (box) and 5th and 95th percentile (whiskers) as well as outliers (single points).Horizontal dashed lines indicate true values of {σ, θ}.
Figure S6: Trade-off between false and true enrichments identified by the RSS-NET extension across 8 simulation scenarios.The title of each panel indicates the ground truth of each simulation scenario.Each panel shows results based on enrichment Bayes factors (BFs) of 200 negative (N) and 200 positive (P) datasets simulated for a given scenario.Each dashed diagonal line has a slope of one and an intercept of zero, indicating the random ROC curve (AUROC=0.5).Higher value of AUROC indicates better performance.Simulation details are provided in Note S2.

Figure S8 :
Figure S8: False positive and discovery rates for 8 simulation scenarios across a wide range of P H

Figure S10 :Figure S11 :
Figure S10: Putative target genes of HC ELEs associated with BMI and schizophrenia.a HC ELEs affect a com-